Introduction

Mental illnesses – disorders that may impact a person’s thoughts, perceptions, feelings, and behaviors – have been shown to affect 10.7% of the world’s population, disrupting lives and contributing to 14.3% of deaths worldwide. However, despite the widespread nature of mental illness and the severe consequences of declining mental health, the subject remains highly stigmatized and difficult to broach in the workplace.

In an effort to better understand mental health in the face of a number of factors, this projects aims to consider various aspects of the individual as well as their workplace environment in order to determine if working adults will seek treatment for mental health issues.

Inspiration

As I conclude my second year of college, I find myself anxious for the future. The thought of facing such uncertainty is extremely daunting, leading me to begin exploring what it means to be in the work environment. Having grown up in a generation that is hyper aware of mental health, becoming the center of driving progress towards social acceptance of mental discussions, I have come to realize that support of mental illness in the workplace is significant to me.

The topic of mental health is an undeniably important one, and through fitting various machine learning models on data collected about mental health in the workplace, I hope to better understand what work environment best supports and encourages individuals to care for their mental well being. Using these findings, employees may be able to find the workplace that best suits their needs, and employers may learn how to better support employees and accommodate for these needs.

Understanding Our Data

Our data is the “Mental Health in Tech Survey” dataset sourced from Kaggle (https://www.kaggle.com/datasets/osmi/mental-health-in-tech-survey) by Open Sourcing Mental Illness, LTD. It is from a 2014 survey that measures attitudes towards mental health and frequency of mental health disorders in the tech workplace.

We will be using the vast majority of data from this dataset in order to predict the variable treatment. A more in-depth description of each variable can be found in codebook.txt, located in the data folder.

Load the Data

We begin by loading in our data so that we can work with it. Displayed below is the first six rows of our data. We make sure to use the clean_names() function so that the variable names in our data is of a consistent format. This makes it easier to use and refer to them throughout the project.

mental_health <- read_csv("data/mental_health.csv") %>% clean_names()
#mental_health <- as_tibble(mental_health) %>% clean_names()
head(mental_health)

Missing Values

We can easily observe that there are missing values in our data. Let’s create a missing values map to figure out how much missing data we’re dealing with!

missing_plot(mental_health)

mental_health <- subset(mental_health, select = -c(country, timestamp, state, work_interfere, comments, phys_health_consequence, phys_health_interview))

We see that there are missing values in state, self_employed, work_interfere, and comments. We can deal with the missing values in self_employed, but we will drop the other three columns, as the majority of these variables are missing and we don’t anticipate them to be majorly important in training our model.

Additionally, we will be dropping phys_health_consequence and phys_health_interview, as our focus will be on mental health rather than physical health. Timestamp and country will also be dropped, as they likely will not be too significant.

Here is our new dataset!

head(mental_health)

Data Cleaning

Next, we move on to cleaning our data. In general, there’s not much work to be done, but notice that there are many unique values in the gender column. We see that there are variations of the words ‘male’ and ‘female’, and there are individuals who identify outside of ‘male’ or ‘female’.

Let’s list out all the unique values of gender so we can gauge what we’re dealing with.

unique(mental_health$gender)
##  [1] "Female"                                        
##  [2] "M"                                             
##  [3] "Male"                                          
##  [4] "male"                                          
##  [5] "female"                                        
##  [6] "m"                                             
##  [7] "Male-ish"                                      
##  [8] "maile"                                         
##  [9] "Trans-female"                                  
## [10] "Cis Female"                                    
## [11] "F"                                             
## [12] "something kinda male?"                         
## [13] "Cis Male"                                      
## [14] "Woman"                                         
## [15] "f"                                             
## [16] "Mal"                                           
## [17] "Male (CIS)"                                    
## [18] "queer/she/they"                                
## [19] "non-binary"                                    
## [20] "Femake"                                        
## [21] "woman"                                         
## [22] "Make"                                          
## [23] "Nah"                                           
## [24] "All"                                           
## [25] "Enby"                                          
## [26] "fluid"                                         
## [27] "Genderqueer"                                   
## [28] "Androgyne"                                     
## [29] "Agender"                                       
## [30] "cis-female/femme"                              
## [31] "Guy (-ish) ^_^"                                
## [32] "male leaning androgynous"                      
## [33] "Man"                                           
## [34] "Trans woman"                                   
## [35] "msle"                                          
## [36] "Neuter"                                        
## [37] "Female (trans)"                                
## [38] "queer"                                         
## [39] "Female (cis)"                                  
## [40] "Mail"                                          
## [41] "cis male"                                      
## [42] "A little about you"                            
## [43] "Malr"                                          
## [44] "p"                                             
## [45] "femail"                                        
## [46] "Cis Man"                                       
## [47] "ostensibly male, unsure what that really means"

Now that we have a list of gender responses, we can sort each response as “M” for males, “F” for females, and “Other” for the other genders. We code LGBTQ+ separately in the “Other” category, because the proportion in the LGBTQ+ group that seek mental health treatment tends to be larger than those who don’t. As such, we group the LGBTQ+ separately from the heterosexual males and females so that our model can train itself more appropriately.

After we categorize the gender values, we can list out all of the new, unique values in the gender column. We see that there are only the values “M”, “F”, and “Other”.

mental_health$gender <- ifelse(mental_health$gender %in% c("M", "Male", "male", "m", "maile", "Cis Male", "Mal", "Male (CIS)", "Make", "Man", "msle", "Mail", "cis male", "Malr", "Cis Man"), "M",
                               ifelse(mental_health$gender %in% c("Female", "female", "Cis Female", "F", "Woman", "f", "Femake", "woman", "cis-female/femme", "Female (cis)", "femail"), "F", 'Other'))

unique(mental_health$gender)
## [1] "F"     "M"     "Other"

Now, we change all of our non-numeric variables into factors. That is, we convert all the variables but age. Keep in mind that the treatment value “No” is coded as the first level.

Here is our new dataset, with gender re-coded and the categorical variables changed into factors.

mental_health <- mental_health %>% 
  mutate(gender = factor(gender)) %>%
  mutate(self_employed = factor(self_employed)) %>%
  mutate(family_history = factor(family_history)) %>%
  mutate(treatment = factor(treatment)) %>%
  mutate(no_employees = factor(no_employees)) %>%
  mutate(remote_work = factor(remote_work)) %>%
  mutate(tech_company = factor(tech_company)) %>%
  mutate(benefits = factor(benefits)) %>%
  mutate(care_options = factor(care_options)) %>%
  mutate(wellness_program = factor(wellness_program)) %>%
  mutate(seek_help = factor(seek_help)) %>%
  mutate(anonymity = factor(anonymity)) %>%
  mutate(leave = factor(leave)) %>%
  mutate(mental_health_consequence = factor(mental_health_consequence)) %>%
  mutate(coworkers = factor(coworkers)) %>%
  mutate(mental_health_interview = factor(mental_health_interview)) %>%
  mutate(mental_vs_physical = factor(mental_vs_physical)) %>%
  mutate(obs_consequence = factor(obs_consequence)) %>%
  mutate(supervisor = factor(supervisor))

head(mental_health)

Exploratory Data Analysis

It is beneficial to see visualizations of our data so that we can better interpret our it and gauge which predictors may be helpful in our model.

counts <- table(mental_health$treatment)
barplot(counts, xlab = 'Sought Treatment', ylab = 'Count', col = c('#eb8060', '#b9e38d'))

In the visualization above, we see counts of the participants who did and did not receive treatment. Out of the 1,259 observations, those who did and did not seek treatment are split approximately in half.

mh <- mental_health %>% mutate(age_group = ifelse(age < 20, "< 20",
                                                  ifelse(age < 30 & age >= 20, "20-30",
                                                         ifelse(age >= 30 & age < 40, "30-40",
                                                                ifelse(age >= 40 & age < 50, '40-50',
                                                                       ifelse( age >= 30 & age < 60, '50-60', '> 60'))))))
ggplot(mh, aes(age_group)) +
  geom_bar(aes(fill = treatment)) +
    scale_fill_brewer(palette="Accent") +
    theme_minimal() +
    labs(fill = "Receiving therapy \n(Yes = 1)") +
    xlab(" ") +
    ggtitle("Employees seeking treatment based on age group")

Next, we visualize the distribution based on group. We sort each participant into age groups, incremented by 10, so that we are able to do these analyses. We can observe that most of the participants are in the 30-40 age group. However, we can also observe that all of these groups are split approximately half in terms of seeking treatment. Thus, we can expect that age will not be a largely important predictor of treatment.

ggplot(mental_health, aes(gender)) +
  geom_bar(aes(fill = treatment)) +
    scale_fill_brewer(palette="Pastel2") +
    theme_minimal() +
    labs(fill = "Receiving therapy \n(Yes = 1)") +
    xlab(" ") +
    ggtitle("Employees seeking treatment based on gender")

From the analysis above, we see that most of the survey respondents are male, and there are very few in the “Other” category. However, of all three genders, the proportion in other that sought treatment was far larger than either male or female. As such, we can presume that gender, especially those sorted into “Other” will be somewhat significant in our model’s prediction.

library(ggcorrplot)

mh2 <- subset(mental_health, select = c(treatment, family_history, tech_company, benefits, supervisor, coworkers, obs_consequence))
model.matrix(~0+., data=mh2) %>% 
  cor(use="pairwise.complete.obs") %>% 
  ggcorrplot(show.diag=FALSE, type="lower", lab=TRUE, lab_size=2)

This is the correlation plot for our data. Since the variables are coded as factors, we see that each value of the predictors is treated as a different variable. For example, treatment is separated into treatmentYes and treatmentNo. This makes it easier to determine which values of each variable are correlated. In this matrix, we are only comparing the variables treatment, family_history, tech_company, benefits, supervisor, coworkers, and obs_consequence. From the plot, we see that there is a strong positive correlation between family history of mental illness and the subject seeking treatment. There is also a positive correlation between a subject seeking treatment and the employer providing mental health benefits. On the other hand, not seeking treatment can be slightly correlated to not being able to discuss mental health with supervisors and coworkers. Those in tech are also less likely to seek mental health aid. Surprisingly, seeking treatment is positively correlated with observing negative consequences for coworkers with mental health conditions in the workplace, which is contrary to our expectation.

Setting Up Our Models

Now that we have cleaned our data and obtained an understanding of it, we can begin setting up our models by splitting out data, creating a recipe, and setting up folds for our k-fold cross-validation.

Data Split

First, we split our data into the testing and training set. The training set will be used first to build our model, so that the model can train itself on the data. After training on the training data, we will test our best fitting models on our testing data so that we can gauge model performance on data it has not seen before.

We will be using a 30-70 split for our data, such that the majority, 70%, of our data is used to train, while an adequate amount, 30%, is leftover for testing. The split is stratified on treatment, the variable we want to predict, so that the distribution of the outcome variable is the same for both the training and testing data.

mental_split <- initial_split(mental_health, prop = 0.7, strata = treatment)
mental_train <- training(mental_split)
mental_test <- testing(mental_split)

Before moving on, we verify that the 30-70 split was successful.

dim(mental_health)
## [1] 1259   20
dim(mental_train)
## [1] 880  20
dim(mental_test)
## [1] 379  20
cat("Proportion of all data in training set: ", nrow(mental_train) / nrow(mental_health))
## Proportion of all data in training set:  0.6989674
cat("Proportion of all data in testing set: ", nrow(mental_test) / nrow(mental_health))
## Proportion of all data in testing set:  0.3010326

We have now confirmed that about 70% of our overall data is in the training set and about 30% of our data is in the testing set!

Recipe Creation

An essential step to fitting our models is building our recipe. The recipe defines the predictors we will be using and our outcome variable. In this case, we will be using the predictors: age, gender, self_employed, family_history, no_employees, remote_work, tech_company, benefits, care_options, wellness_program, seek_help, anonymity, leave, mental_health_consequence, coworkers, supervisor, mental_health_interview, mental_vs_physical, and obs_consequence. We will be predicting the outcome variable of treatment.

Recall that when we were dealing with missing data, we dropped all columns with missing values with the exception of self_employed. Now, self_employed is our only predictor with missing values, and we will be using self_impute_mode() to fill all of the missing self_employed observation with the most frequent value in the self_employed column.

Since all of the predictors, besides age, are categorical and discrete, we will be using step_dummy() to turn our nominal variables into dummy variables. We will also use step_normalize() to center and scale all of our numeric predictors (age).

mental_recipe <- recipe(treatment ~ age + gender + self_employed + family_history +
                          no_employees + remote_work + tech_company + benefits + care_options +
                          wellness_program + seek_help + anonymity + leave +
                          mental_health_consequence + coworkers + supervisor +
                          mental_health_interview + mental_vs_physical + obs_consequence, data =
                          mental_train) %>%
  step_impute_mode(self_employed) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_predictors())

K-Fold Cross Validation with Stratified Sampling

Now, we will create our 5 folds for k-fold cross-validation. In k-fold cross-validation, each fold is used as the testing set once with the other k-1 folds being used as the training set. To measure performance amongst all of the folds, the average accuracy is taken. K-fold cross-validation provides more accurate results, because we will be using several samples for it. We also need to make sure to stratify the folds on the outcome variable so that we balance out the data across folds.

Below, we can see that our five folds were successfully created.

cv <- vfold_cv(mental_train, v = 5, strata = "treatment")
cv

Fitting Our Models

Now, we will be building our models! I will be testing logistic regression, k nearest neighbors, elastic net, random forest, and gradient-boosted trees. To evaluate model performance, I will be using ROC AUC. The ROC AUC measures the area under the ROC curve, that is, it judges a model’s ability to discriminate between classes (in this case, it is classes of “Yes” and “No” for the outcome variable). It measures overall performance of the classifier and is summarized over all possible thresholds. The higher the ROC AUC, the better the model.

Set Up Models

We will be specifying the models to be fitted (logistic regression, k nearest neighbors, elastic net, random forest, and gradient-boosted trees), the parameters we will be tuning, and each model’s engine and mode. Please note that the mode for all of the models will be ‘classification’, as our outcome variable is discrete, either sorted into “Yes” or “No”.

# Logistic Regression
# No tuning will be done
log_mod <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

# K Nearest Neighbors
# Tuning number of neighbors
knn_mod <- nearest_neighbor(neighbors = tune()) %>%
  set_engine('kknn') %>%
  set_mode('classification')


# Elastic Net
# Tuning mixture and penalty
net_mod <- logistic_reg(mixture = tune(),
                              penalty = tune()) %>%
  set_mode("classification") %>%
  set_engine("glmnet")

# Random Forest
# Tuning mtry, trees, and min_n
rf_mod <- rand_forest(mtry = tune(), 
                           trees = tune(), 
                           min_n = tune()) %>%
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

# Gradient-Boosted Trees
# Tuning mtry, trees, and learn_rate
bt_mod <- boost_tree(mtry = tune(), 
                           trees = tune(), 
                           learn_rate = tune()) %>%
  set_engine("xgboost") %>% 
  set_mode("classification")

Structure Workflow

Now, we build the workflows for each of our models. In our workflows, we specify the model as well as the recipe we made previously.

# Logistic Regression
log_wkflow <- workflow() %>% 
  add_model(log_mod) %>% 
  add_recipe(mental_recipe)

# K Nearest Neighbors
knn_wkflow <- workflow() %>% 
  add_model(knn_mod) %>% 
  add_recipe(mental_recipe)

# Elastic Net
net_wkflow <- workflow() %>%
  add_model(net_mod) %>%
  add_recipe(mental_recipe)

# Random Forest
rf_wkflow <- workflow() %>% 
  add_model(rf_mod) %>% 
  add_recipe(mental_recipe)

# Gradient-Boosted Trees
bt_wkflow <- workflow() %>% 
  add_model(bt_mod) %>% 
  add_recipe(mental_recipe)

Grid Building

Building the grids for each model allows us to dictate what values our tuned variables should take on. That is, we can specify the ranges of parameters we will be tuning as well as how many levels.

knn_grid <- grid_regular(neighbors(range = c(1,9)),
                         levels = 10)

net_grid <- grid_regular(penalty(),
                        mixture(range = c(0, 1)),
                             levels = 10)

rf_grid <- grid_regular(mtry(range = c(1, 19)),
                        trees(range = c(200, 600)),
                        min_n(range = c(10, 20)),
                        levels = 8)

bt_grid <- grid_regular(mtry(range = c(1, 19)),
                        trees(range = c(200, 600)),
                        learn_rate(range = c(-10, -1)),
                        levels = 8)

Model Tuning

Now, we can tune all of our models using our pre-defined workflows, cross-validation folds, and tuning grids! Please note that we will not be tuning the logistic regression model, as we did not specify any parameters to tune. These models will take a considerable amount of time to tune, so we will save our tuned models to separate RDS files so that we can avoid tuning our models each time.

knn_tune <- tune_grid(object = knn_wkflow,
                      resamples = cv,
                      grid = knn_grid)

net_tune <- tune_grid(object = net_wkflow,
                      resamples = cv,
                      grid = net_grid)

rf_tune <- tune_grid(object = rf_wkflow,
                     resamples = cv,
                     grid = rf_grid)

bt_tune <- tune_grid(object = bt_wkflow,
                     resamples = cv,
                     grid = bt_grid)

write_rds(knn_tune, file = "tuned_models/knn_tune.rds")
write_rds(net_tune, file = "tuned_models/net_tune.rds")
write_rds(rf_tune, file = "tuned_models/rf_tune.rds")
write_rds(bt_tune, file = "tuned_models/bt_tune.rds")

Load in Tuned Models

Here, we load back in our tuned models from the RDS files so that we can work with them more efficiently!

knn_tune <- read_rds(file = "tuned_models/knn_tune.rds")
net_tune <- read_rds(file = "tuned_models/net_tune.rds")
rf_tune <- read_rds(file = "tuned_models/rf_tune.rds")
bt_tune <- read_rds(file = "tuned_models/bt_tune.rds")

Model Plots

Now, we will be using the autoplot() function in R to visualize how changing each parameter affects the performance of the models. This performance is measured by ROC AUC, as mentioned before. There will not be an autoplot for our logistic regression model because we did not tune any parameters for it.

knn_tune %>% autoplot(metric = 'roc_auc')

In our k nearest neighbors model, we see that there is a clear trend between number of neighbors and model performance. That is, the greater the number of neighbors, the higher the ROC AUC, and thus, the better the model performance.

net_tune %>% autoplot(metric = 'roc_auc')

In our elastic net model, we tuned the proportion of lasso penalty (penalty) as well as the amount of regularization (mixture). It seemed as though all combinations performed the same throughout, with the exception of a brief spike when the mixture value was slightly greater than 1e-02. In this spike, we see that smaller values of penalty performed better. After the spike, as mixture takes on a larger value, there was also a large drop in performance. Thus, we can expect that our elastic net model performs best when its mixture value is slightly greater than 1e-02 and it has a smaller penalty value.

rf_tune %>% autoplot(metric = 'roc_auc')

In our Random Forest model, we tuned the minimal node size (min_n), number of randomly selected predictors (mtry), and number of trees (trees). Each plot seems to follow the same relative trend, beginning with low performance for smaller number of randomly selected predictors, rising to peek performance between the mtry values of three to seven, then steadily decreasing again as the mtry values become larger. It does not seem as though there is much effect of minimal node size, and it is difficult to identify a trend in the performance based on number of trees, as they seem to differ from each plot, with no consistent pattern. Moreover, the lines that represent trees overlap quite a bit, indicating that there is not much effect of trees either. From these plots, we see that the highest point reach amongst all of them is when minimum node size is 10, number of randomly selected predictors is around 6, and number of trees is 200, peeking at an ROC AUC that is slightly above 0.770.

bt_tune %>% autoplot(metric = 'roc_auc')

In our Gradient-Boosted Trees model, we tuned learning rate (learn_rate), the number of randomly selected predictors (mtry), and the number of trees (trees). In our first plot, which displayed the learning rate of 1e-10, we see that the model performed at 0.5 throughout, regardless of different values of mtry and trees. However, once the learning rate became slightly larger, we see our model performance increasing to over 0.7 with a higher number of predictors and more trees. ur model performance stays relatively consistent until the plot of 0.1 learning rate, where the ROC AUC begins to fall, performance decreasing with a higher number of predictors and a higher number of trees.

Model Results

Now, we can display the results for each combination of our tuned parameters. We also need to fit our logistic regression model, since it was not tuned with the other models! We will be judging the models based on their mean ROC AUC.

All results for k nearest neighbors:

log_fit <- fit_resamples(log_wkflow, resamples = cv)
log_metrics <- collect_metrics(log_fit)

knn_metrics <- collect_metrics(knn_tune)
net_metrics <- collect_metrics(net_tune)
rf_metrics <- collect_metrics(rf_tune)
bt_metrics <- collect_metrics(bt_tune)

knn_metrics[knn_metrics$.metric == 'roc_auc',]

All results for logistic regression:

Note: There is only one model for logistic regression, because we did not tune any parameters.

log_metrics[log_metrics$.metric == 'roc_auc',]

All results for elastic net:

net_metrics[net_metrics$.metric == 'roc_auc',]

All results for random forest:

rf_metrics[rf_metrics$.metric == 'roc_auc',]

All results for gradient-boosted trees:

bt_metrics[bt_metrics$.metric == 'roc_auc',]

Here are the best models for each type of machine learning model we fitted, as well as their ROC AUC. Overall, our models did a decent job!

log_best <- select_best(log_fit, metric = 'roc_auc')
knn_best <- select_best(knn_tune, metric = 'roc_auc', neighbors)
net_best <- select_best(net_tune, metric = 'roc_auc', penalty, mixture)
rf_best <- select_best(rf_tune, metric = 'roc_auc', mtry, trees, min_n)
bt_best <- select_best(bt_tune, metric = 'roc_auc', mtry, trees, learn_rate)

best_log_model <- log_metrics[log_metrics$.metric == 'roc_auc' & log_metrics$.config == log_best$.config,]
best_knn_model <- knn_metrics[knn_metrics$.metric == 'roc_auc' & knn_metrics$.config == knn_best$.config,]
best_net_model <- net_metrics[net_metrics$.metric == 'roc_auc' & net_metrics$.config == net_best$.config,]
best_rf_model <- rf_metrics[rf_metrics$.metric == 'roc_auc' & rf_metrics$.config == rf_best$.config,]
best_bt_model <- bt_metrics[bt_metrics$.metric == 'roc_auc' & bt_metrics$.config == bt_best$.config,]

roc_comparisons <- tibble(Model = c("Logistic Regression", "K Nearest Neighbors", "Elastic Net", "Random Forest", "Boosted Trees"), Specific_Model = c(log_best$.config, knn_best$.config,net_best$.config, rf_best$.config, bt_best$.config),
                          ROC_AUC = c(best_log_model$mean, best_knn_model$mean, best_net_model$mean, best_rf_model$mean, best_bt_model$mean))

roc_comparisons

Best Models

From the table above, we see that Boosted Trees Model 211 performed the best (ROC AUC: 0.7714193), with Random Forest Model 3 coming in a close second (ROC AUC: 0.7707865).

best_bt_model

The best boosted trees model, and best model overall, was model 211 with a learning rate of 3.727594e-08, 314 trees, and eight random predictors.

best_rf_model

The second best model was Random Forest #3 with a minimum node threshold of 10, 6 random predictors, and 200 trees.

Fitting to Training Data

Now that we have decided our best models, we will restructure our workflows to encapsulate the tuned variables and re-fit the models on the training data.

# Boosted Trees
final_bt_wkflow <- finalize_workflow(bt_wkflow, bt_best)
final_bt_fit <- fit(final_bt_wkflow, data = mental_train)

## Random Forest
final_rf_wkflow <- finalize_workflow(rf_wkflow, rf_best)
final_rf_fit <- fit(final_rf_wkflow, data = mental_train)

write_rds(final_bt_fit, file = "training_fit/final_bt_fit.rds")
write_rds(final_rf_fit, file = "training_fit/final_rf_fit.rds")

Below are the Variable Importance Plots (VIP) for the two best-performing models. VIPs show which predictors were most important when predicting the outcome variable of treatment.

final_bt_fit <- read_rds(file = "training_fit/final_bt_fit.rds")
final_rf_fit <- read_rds(file = "training_fit/final_rf_fit.rds")

final_bt_fit %>% extract_fit_parsnip() %>% 
  vip() +
  theme_minimal()

final_rf_fit %>% extract_fit_parsnip() %>% 
  vip() +
  theme_minimal()

For both models, family_history turned out to be the predictor that mattered most when determining if a working individual will seek treatment. It was surprising that age ranked towards the top for both models, as from the exploratory data analysis we performed at the beginning of the project, we expected that age would not matter a lot since each age group seemed to be distributed evenly between the two values of treatment.

Testing Best Models

Finally, we will test our models by fitting it on to the testing data! This way, we can observe the way our models perform on data it has never seen before.

final_bt_fit <- read_rds(file = "training_fit/final_bt_fit.rds")
final_rf_fit <- read_rds(file = "training_fit/final_rf_fit.rds")

final_bt_test <- augment(final_bt_fit, 
                               mental_test) %>% 
   roc_auc(treatment, .pred_No)

final_rf_test <- augment(final_rf_fit, 
                               mental_test) %>%
  roc_auc(treatment, .pred_No)

final_bt_test
final_rf_test

From the ROC AUC values, we see that both gradient-boosted trees and random forest models performed better on the testing data, with boosted-trees still performing best! The ROC AUC values of 0.7843555 and 0.7769886 show that our models performed relatively well!

Conclusion

Through this project, we were able to shed some light on how an individual’s personal traits and their work environment may affect the possibility of them seeking help for mental health issues. In order to do this, we explored our data, cleaning and analyzing it so that we were able to better understand it and make it easier for our machine learning models to use. Next, we split our data and fit our training data on multiple models, incorporating hyperparameter tuning comparing ROC AUC to measure the models’ performance and find the model that fit our data best.

We were able to conclude that, although all the models did decently well in predicting our outcome variable, gradient-boosted trees and random forest came out on top. This may be due to the fact that both random forest and gradient-boosted trees are capable of capturing non-linear relationships between features and the target variable, which may have allowed them to identify complex patterns in our dataset.

Our worst performing model was revealed to be K Nearest Neighbors, with an ROC AUC of 0.6789875. K Nearest Neighbors works so that each observation is categorized using the datapoints closest to it. As in, if an observation has values for its predictors that the model has not seen before, it will not perform well, as it does not have prior data to make its prediction off of. Thus, K Nearest Neighbors tends to work poorly on datasets with many predictors. In our case, a large number of predictors (19) were used, which makes it unsurprising that K Nearest neighbors performed the worst.

Although a lot was learned from this project, there are still loose ends that further exploration could tie up. For example, in the visualizations of age, we saw that the distribution of those who sought treatment in each age group was generally evenly disbursed between ’Yes” and “No”. Thus, we believed that age would not end up being an important predictor. However, in our VIPs for our best-performing models, we are able to see that age was one of the variables that mattered more for both of our models. Thus, if I were to continue with this project, it would be fascinating to delve into why this may be.

Overall, I am grateful to have had the opportunity to combine my passion for psychology with everything I have learned about machine learning. This project has truly shown me that there is so much we can discover about each other and our surroundings through data, and with developing technology, there are many ways in which amateur data scientists like myself can make a difference. Perhaps with more complex explorations in this subject, we may be able to make mental health a comfortable topic in the workplace and bring working individuals the support they need.